course logistics
overall motivation
decomposing interpretability
Instructor
TAs
We will announce in-person and virtual office hours depending on the results of the poll on the syllabus. Submit your preferences by Sunday at 6pm.
Tuesday: New methods
Thursday: Case study
Friday (every ~ 2 weeks)
Slides and Readings
Assignments. Discussion, In-Class Exercises, HW.
Piazza: https://piazza.com/wisc/spring2026/sp26stat479003/home
Canvas Page: https://canvas.wisc.edu/courses/499874/modules
Late submissions are penalized 10% per day late, up to 5 days. No submissions accepted after that.
I have curated materials to guide you through the subject. But only you can do the work to master the topic. Read, reflect, ask questions.
To learn deeply, it helps to study the same object from many angles. I hope to share learning techniques that you can use beyond this class.
I think this topic is important in the real world. We need knowledgeable and wise data scientists who can build interpretable ML systems.
Lipton, Z. C. (2018). The Mythos of Model Interpretability. ACM Queue: Tomorrow’s Computing Today, 16(3), 31–57. https://doi.org/10.1145/3236386.3241340
Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116
Example from (Caruana et al. 2015).
Example from (Gu et al. 2019).
Example from (DeGrave, Janizek, and Lee 2021).
In machine learning, most effort is directed towards ensuring models have good performance metrics on external benchmark data sets.
Models learning this way can be very accurate according to easily measured criteria like accuracy or computational efficiency but inappropriate with respect to properties that are harder to measure.
The gap between what we want our models to achieve and what we can easily encode in performance metrics is called a formulation gap.
Would you be willing to relinquish control to the model?
The answer depends on how it manages individual cases, not just than overall accuracy.
Typical benchmarks randomly split data into training vs. test sets.
Models are often used in settings that don’t match those original training/test splits.
The use of models might themselves change the distribution of the data (pneumonia example from before).
Models are often used to support discovery. This is a different task than automation.
While this is often an argument for using “white box” models, black boxes can still support discovery, e.g., by identifying similar cases in a medical diagnosis system.
Models might amplify existing biases if only test accuracy is considered.
Fairness metrics have been defined to help guard against this risk, but there is no universal metric for fairness. Interpretability can help address broader demands for transparency.
Introduce yourself to your neighbors. What is your name and degree program? What are your areas of interest? How might interpretability or explainability be helpful in the work that you do?
Then respond to [Past Experience] in the exercise sheet.
Reference (Murdoch et al. 2019) also breaks the vague concept of “interpretability” down into precise elements which can be more formally evaluated.
Together, this helps establish trust in the reliability of the results, which is important in interdisciplinary work.
It also helps protect against unintended consequences that can arise after model deployment.
Predictive Accuracy: The model-to-explain has to be accurate. There is not point “interpreting” a model that gives a poor approximation of reality.
Descriptive Accuracy: The interpretation should be faithful to the model. This is the extent to which the explanation reflects what the black box actually learned, which is not necessarily the same as what it was designed to learn.
Interpretations don’t exist in a vacuum. Like data visualizations, their complexity needs to be suitable to their audience.
For example, we might give three different explanations of the same model depending on whether we are communicating with biologists, clinicians, or statisticians.
Whether the interpretability outputs are relevant to their audience can be gauged by their adoption in specific scientific settings or how they are actually used by participants in user studies.
Reference (Doshi-Velez and Kim 2017) notes that new interpretability techniques can be evaluated at several levels.
Functionally-grounded: Define computational proxy tasks that can be measured without studying real users.
Human-grounded: Consider simplified tasks that can be solved by general audience members. This can involve crowdsourcing.
Application grounded: Evaluate in the field with representative experts in a concrete end-use case.
These different types of evaluation can inform one another. For example, we can define new proxy tasks based on the most challenging steps for experts.
New methods that do well in computational proxies are worth investigating through user studies.
Each of these types fo evaluation will come with their own performance metrics.
We will revisit this question periodically as we introduce new methods and study the contexts in which they are worth applying.